34 research outputs found
An Attention-driven Hierarchical Multi-scale Representation for Visual Recognition
Convolutional Neural Networks (CNNs) have revolutionized the understanding of
visual content. This is mainly due to their ability to break down an image into
smaller pieces, extract multi-scale localized features and compose them to
construct highly expressive representations for decision making. However, the
convolution operation is unable to capture long-range dependencies such as
arbitrary relations between pixels since it operates on a fixed-size window.
Therefore, it may not be suitable for discriminating subtle changes (e.g.
fine-grained visual recognition). To this end, our proposed method captures the
high-level long-range dependencies by exploring Graph Convolutional Networks
(GCNs), which aggregate information by establishing relationships among
multi-scale hierarchical regions. These regions consist of smaller (closer
look) to larger (far look), and the dependency between regions is modeled by an
innovative attention-driven message propagation, guided by the graph structure
to emphasize the neighborhoods of a given region. Our approach is simple yet
extremely effective in solving both the fine-grained and generic visual
classification problems. It outperforms the state-of-the-arts with a
significant margin on three and is very competitive on other two datasets.Comment: Accepted in the 32nd British Machine Vision Conference (BMVC) 202
Coarse Temporal Attention Network (CTA-Net) for Driver’s Activity Recognition
There is significant progress in recognizing traditional human activities
from videos focusing on highly distinctive actions involving discriminative
body movements, body-object and/or human-human interactions. Driver's
activities are different since they are executed by the same subject with
similar body parts movements, resulting in subtle changes. To address this, we
propose a novel framework by exploiting the spatiotemporal attention to model
the subtle changes. Our model is named Coarse Temporal Attention Network
(CTA-Net), in which coarse temporal branches are introduced in a trainable
glimpse network. The goal is to allow the glimpse to capture high-level
temporal relationships, such as 'during', 'before' and 'after' by focusing on a
specific part of a video. These branches also respect the topology of the
temporal dynamics in the video, ensuring that different branches learn
meaningful spatial and temporal changes. The model then uses an innovative
attention mechanism to generate high-level action specific contextual
information for activity recognition by exploring the hidden states of an LSTM.
The attention mechanism helps in learning to decide the importance of each
hidden state for the recognition task by weighing them when constructing the
representation of the video. Our approach is evaluated on four publicly
accessible datasets and significantly outperforms the state-of-the-art by a
considerable margin with only RGB video as input.Comment: Extended version of the accepted WACV 202
SR-GNN: Spatial Relation-aware Graph Neural Network for Fine-Grained Image Categorization
Over the past few years, a significant progress has been made in deep
convolutional neural networks (CNNs)-based image recognition. This is mainly
due to the strong ability of such networks in mining discriminative object pose
and parts information from texture and shape. This is often inappropriate for
fine-grained visual classification (FGVC) since it exhibits high intra-class
and low inter-class variances due to occlusions, deformation, illuminations,
etc. Thus, an expressive feature representation describing global structural
information is a key to characterize an object/ scene. To this end, we propose
a method that effectively captures subtle changes by aggregating context-aware
features from most relevant image-regions and their importance in
discriminating fine-grained categories avoiding the bounding-box and/or
distinguishable part annotations. Our approach is inspired by the recent
advancement in self-attention and graph neural networks (GNNs) approaches to
include a simple yet effective relation-aware feature transformation and its
refinement using a context-aware attention mechanism to boost the
discriminability of the transformed feature in an end-to-end learning process.
Our model is evaluated on eight benchmark datasets consisting of fine-grained
objects and human-object interactions. It outperforms the state-of-the-art
approaches by a significant margin in recognition accuracy.Comment: Accepted manuscript - IEEE Transaction on Image Processin
Attend and Guide (AG-Net): A Keypoints-driven Attention-based Deep Network for Image Recognition
This paper presents a novel keypoints-based attention mechanism for visual
recognition in still images. Deep Convolutional Neural Networks (CNNs) for
recognizing images with distinctive classes have shown great success, but their
performance in discriminating fine-grained changes is not at the same level. We
address this by proposing an end-to-end CNN model, which learns meaningful
features linking fine-grained changes using our novel attention mechanism. It
captures the spatial structures in images by identifying semantic regions (SRs)
and their spatial distributions, and is proved to be the key to modelling
subtle changes in images. We automatically identify these SRs by grouping the
detected keypoints in a given image. The ``usefulness'' of these SRs for image
recognition is measured using our innovative attentional mechanism focusing on
parts of the image that are most relevant to a given task. This framework
applies to traditional and fine-grained image recognition tasks and does not
require manually annotated regions (e.g. bounding-box of body parts, objects,
etc.) for learning and prediction. Moreover, the proposed keypoints-driven
attention mechanism can be easily integrated into the existing CNN models. The
framework is evaluated on six diverse benchmark datasets. The model outperforms
the state-of-the-art approaches by a considerable margin using Distracted
Driver V1 (Acc: 3.39%), Distracted Driver V2 (Acc: 6.58%), Stanford-40 Actions
(mAP: 2.15%), People Playing Musical Instruments (mAP: 16.05%), Food-101 (Acc:
6.30%) and Caltech-256 (Acc: 2.59%) datasets.Comment: Published in IEEE Transaction on Image Processing 2021, Vol. 30, pp.
3691 - 370
SR-GNN: Spatial Relation-aware Graph Neural Network for Fine-Grained Image Categorization
Over the past few years, a significant progress has been made in deep
convolutional neural networks (CNNs)-based image recognition. This is mainly
due to the strong ability of such networks in mining discriminative object pose
and parts information from texture and shape. This is often inappropriate for
fine-grained visual classification (FGVC) since it exhibits high intra-class
and low inter-class variances due to occlusions, deformation, illuminations,
etc. Thus, an expressive feature representation describing global structural
information is a key to characterize an object/ scene. To this end, we propose
a method that effectively captures subtle changes by aggregating context-aware
features from most relevant image-regions and their importance in
discriminating fine-grained categories avoiding the bounding-box and/or
distinguishable part annotations. Our approach is inspired by the recent
advancement in self-attention and graph neural networks (GNNs) approaches to
include a simple yet effective relation-aware feature transformation and its
refinement using a context-aware attention mechanism to boost the
discriminability of the transformed feature in an end-to-end learning process.
Our model is evaluated on eight benchmark datasets consisting of fine-grained
objects and human-object interactions. It outperforms the state-of-the-art
approaches by a significant margin in recognition accuracy.Comment: Accepted manuscript - IEEE Transaction on Image Processin
Context-aware Attentional Pooling (CAP) for Fine-grained Visual Classification
Deep convolutional neural networks (CNNs) have shown a strong ability in
mining discriminative object pose and parts information for image recognition.
For fine-grained recognition, context-aware rich feature representation of
object/scene plays a key role since it exhibits a significant variance in the
same subcategory and subtle variance among different subcategories. Finding the
subtle variance that fully characterizes the object/scene is not
straightforward. To address this, we propose a novel context-aware attentional
pooling (CAP) that effectively captures subtle changes via sub-pixel gradients,
and learns to attend informative integral regions and their importance in
discriminating different subcategories without requiring the bounding-box
and/or distinguishable part annotations. We also introduce a novel feature
encoding by considering the intrinsic consistency between the informativeness
of the integral regions and their spatial structures to capture the semantic
correlation among them. Our approach is simple yet extremely effective and can
be easily applied on top of a standard classification backbone network. We
evaluate our approach using six state-of-the-art (SotA) backbone networks and
eight benchmark datasets. Our method significantly outperforms the SotA
approaches on six datasets and is very competitive with the remaining two.Comment: Extended version of the accepted paper in 35th AAAI Conference on
Artificial Intelligence 202